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CONTOUR REGRESSION: A GENERAL APPROACH TO 
DIMENSION REDUCTION 1 

By Bing Li, Hongyuan Zha and Francesca Chiaromonte 

Pennsylvania State University 

We propose a novel approach to sufficient dimension reduction in 
regression, based on estimating contour directions of small variation 
in the response. These directions span the orthogonal complement of 
the minimal space relevant for the regression and can be extracted 
according to two measures of variation in the response, leading to sim- 
ple and general contour regression (SCR and GCR) methodology. In 
comparison with existing sufficient dimension reduction techniques, 
this contour-based methodology guarantees exhaustive estimation of 
the central subspace under ellipticity of the predictor distribution 
and mild additional assumptions, while maintaining y^-consistency 
and computational ease. Moreover, it proves robust to departures 
from ellipticity. We establish population properties for both SCR and 
GCR, and asymptotic properties for SCR. Simulations to compare 
performance with that of standard techniques such as ordinary least 
squares, sliced inverse regression, principal Hessian directions and 
sliced average variance estimation confirm the advantages anticipated 
by the theoretical analyses. We demonstrate the use of contour-based 
methods on a data set concerning soil evaporation. 

1. Introduction and background. Consider the regression of a response Y 
on a vector of continuous predictors X = (X\, . . . ,X p ) T S MP. Sufficient di- 
mension reduction is a body of theory and methods for reducing the di- 
mension of X while preserving information on the regression, that is, on 
the conditional distribution of Y\X (see [7, 15, 16]). A dimension reduction 
subspace [3, 4] is defined as the column space of any p x d (d<p) matrix n 
such that 

(1) Y±X\r] T X, 
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where JL indicates independence. Thus, conditioning on 77 X, Y and X are 
independent, or equivalently, the conditional distribution of Y\X equals that 
of Y\rj T X. Because relation (1) is unaffected by multiplying 77 from the right 
by a nonsingular matrix, what matters is the column space of 77 rather than 
its specific form. Also note that there can be many subspaces satisfying (1), 
because if it holds for 77, then it also holds for any other matrix whose column 
space includes that of 77. Naturally, we are interested in the subspace with the 
minimal dimension. Under mild conditions that are almost always verified 
in practice, the minimal subspace is uniquely defined and coincides with the 
intersection of all subspaces satisfying (1) (see [1, 4]). This intersection is 
called the central subspace, denoted by S Y \xi an d its dimension is called the 
structural dimension, denoted by q. 

The central subspace can be estimated without estimating a response 
surface, and without strong assumptions on the form of the dependence 
between Y and X. Well-known estimation methods include ordinary least 
squares (OLS, [17]), sliced inverse regression (SIR, [15]; see also [9]), prin- 
cipal Hessian directions (PHD, [16]) and sliced average variance estimation 
(SAVE, [7]). These methods constitute effective premodeling tools to reduce 
high-dimensional regressions to equivalent ones comprising only a few linear 
combinations of the original predictors. Such a reduction greatly facilitates 
model building, as well as the use of nonparametric techniques. Dimension 
reduction methods also provide a comprehensive visualization of the data 
whenever the estimated structural dimension is 1, 2 or possibly 3, which is 
the case in a vast majority of practical applications. In this sense, sufficient 
dimension reduction provides a foundation for regression graphics, as argued 
in [4] and [1]. 

In many studies attention is restricted to the location component of the 
dependence between Y and X , that is, to the regression function E(Y\X). 
The central mean subspace, Se(y\X)i was introduced by Cook and Li [5]. 
Because the conditional mean E(Y\X) is determined by the distribution of 
Y\X, the central mean subspace is always contained in the central subspace. 
Cook and Li investigated the above mentioned methods in relation to their 
ability to estimate directions within the Se(y\x)j an d proposed alternative 
methods to target this subspace directly. See also [14]. 

The above methods enjoy the advantage of being computationally inex- 
pensive and -y/n-consistent regardless of the original predictor dimension p 
and the structural dimension q, thus avoiding the "curse of dimensionality" 
often affecting nonparametric techniques. The -y/n-consistency is achieved 
because these methods exploit global features of the dependence of Y on 
X, in the sense they involve averaging among fixed portions of the data, 
regardless of the sample size. For instance, OLS employs sample moments, 
and SIR involves averaging predictors within slices of Y , where the size of 
each slice need not go to zero as n — > 00 . 
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The above methods also have common limitations, however. First, they 
require linear conditional means among predictors ([4], page 57), which is 
often imposed by assuming ellipticity of the distribution of X. When this 
condition fails, the estimators may converge to directions outside Sy\x- Sec- 
ond, even when linearity holds, OLS, PHD and SIR are not guaranteed to 
be exhaustive: they may converge at the -y/n-rate to a set of vectors that 
are in Sy\x but do not span Sy\x- This lack of exhaustiveness is arguably 
one of the most important shortcomings of these methods. An instance is 
the heavy reliance of methods such as OLS and SIR on monotone trends in 
the dependence of Y on X. For example, if Y = (f3 T X) 2 + ae with j3 6 W 
and X ~ N(0,I p ), OLS and SIR will estimate and therefore fail to de- 
tect (3 itself. Based on early results obtained by Peters, Redner and Decell 
[19] in the special context of feature extraction, it can be shown that SAVE 
is indeed exhaustive if X\Y is normally distributed. However, as we will 
see in Section 3, this assumption is very restrictive in the regression context. 
Thus, it is of both practical and theoretical significance to pursue exhaustive 
estimation under reasonably general sufficient conditions. 

At the opposite end of the spectrum are adaptive methods that exploit 
local features of the dependence of Y on X [13, 23]. The strength of these 
methods is that they require much weaker assumptions (virtually none) on 
the distribution X. However, because they employ multivariate kernels that 
shrink with the sample size, their convergence rates are generally slower than 
y/n. In addition, they are computationally intensive, as they iterate between 
nonparametric estimation of a multivariate unknown function and numerical 
maximization of the estimated function over a potentially high-dimensional 
space. 

Here we propose a novel approach that targets contour directions, that is, 
directions along which the response surface is flat. Since contour directions 
span the orthogonal complement of the central subspace, estimating the 
former is equivalent to estimating the latter. We propose to extract contour 
directions according to two measures of variation in the response, leading to 
two methods: simple and general contour regression (SCR and GCR). Unlike 
traditional global methods such as OLS, SIR, PHD and SAVE, contour 
regression guarantees exhaustive estimation of the central subspace under 
ellipticity of X and very mild additional assumptions. It also proves robust 
to violations of ellipticity. At the same time, unlike local methods, contour 
regression achieves -y/n-consistency regardless of the dimensions p and q, and 
it is computationally inexpensive. 

The remainder of the paper is organized as follows: Section 2 concerns 
population-level properties and asymptotic properties of SCR, and Section 3 
presents sufficient conditions for exhaustiveness of SCR. Section 4 concerns 
population-level properties of GCR. Section 5 discusses the robustness of 
GCR against violations of predictor ellipticity. Section 6 presents simulations 
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comparing the performance of SCR and GCR with that of OLS, SIR, PHD 
and SAVE. Section 7 reports the analysis of a data set. Section 8 contains 
final remarks. The main proofs are reported in the body of the paper, but 
some technical details are relegated to the Appendix. 

2. Simple contour regression. 

2.1. Basic concepts. Let (Xi,Y{), . . . , (X n ,Y n ) be independent copies of 
the random pair (X, Y), where and Y E R, let Fxy be the joint dis- 

tribution of (X,Y), and let F n be the corresponding empirical distribution. 
We will be concerned with matrix- valued estimators of the form T(F n ). If 
the columns of T(Fxy) belong to Sy\x> then we say that T(F n ) is unbi- 
ased at the population level. If the columns of T(Fxy) actually span S Y \x ; 
then we say that T(F n ) is exhaustive at the population level. If T(F n ) con- 
verges at the -^/n-rate to T(Fxy) m the first case, then we say that it is 
■y/n-consistent. If the -^/n-convergence holds in the second case, then we say 
that T(F n ) is -^/n-exhaustive. In this section we will introduce simple con- 
tour regression and establish its exhaustiveness at the population level, as 
well as its y^R-exhaustiveness. 

To illustrate the basic intuition underlying contour regression, it is help- 
ful to make an analogy between the empirical directions employed in this 
approach and empirical distributions. In defining an empirical distribution 
we put an equal probability mass at each observed point, based on the ra- 
tionale that all the information about the random vector X carried by the 
random sample X\ , X n is captured by the positions of the observations 
themselves. With a similar rationale, it can be argued that all the directional 
information about X carried by the data is captured by the empirical di- 
rections (Xi — Xj), i ^ j, i,j = 1, . . . ,n. Since dimension reduction is about 
finding a specific set of directions — those along which the conditional dis- 
tribution of Y\X genuinely depends on X — it is natural to focus on these 
(2) empirical directions. Roughly, contour regression extracts a subset of the 
empirical directions characterized by having small variation in Y, and then 
performs a principal component analysis on the extracted directions. Since 
contour directions form the orthogonal complement of the central subspace, 
an estimator of the latter is obtained from the components with smallest 
eigenvalues. 

It is also important to remark that many global methods (see Introduc- 
tion) gather directional information by slicing the response and processing 
predictor observations separately within each slice. This way, interslice di- 
rectional information relevant to the regression, if any, is lost. On the other 
hand, empirical directions (Xi — Xj) can "cut across" response slices, thus 
allowing contour regression to exploit interslice information. 

For ease of exposition we will first introduce population-level quantities 
and then construct estimators by analogy. 
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2.2. Population-level exhaustiveness. Let (X,Y) be an independent copy 
of (X,Y) and suppose that the central subspace Sy\x f° r the regression of 
Y on X is spanned by the column space of a p x q matrix (3 with q < p. 
Consider the matrix 

K{c) = E[(X - X)(X - X) T | \Y - Y\ < c). 

We will show that the eigenvectors of K{c) corresponding to its smallest q 
eigenvalues span the central subspace. For this purpose we need the following 
assumption. 

Assumption 2.1. For any choice of vectors v E Sy\x an d w E (Sy\x) ± 
such that \\v\\ = \\w\\ = 1, and some constant c> 0, we have 

(2) var[w T (X - X) \ \Y - Y\ < c] > var[v T (X - X) \ \Y - Y\ < c}. 

This assumption is a reasonable one: because the conditional distribution 
of Y\X depends on v T X but not on w T X, we expect Y to vary more with 
v T X than it does with w T X. Hence, intuitively, within the same increment 
of Y, w T X should vary more than v T X does. In Section 3 we will prove 
that Assumption 2.1 holds under fairly general conditions. Here we give a 
few examples to illustrate its wide applicability. For reasons that will become 
clear later on, we will always impose this assumption on the standardized 
predictors with mean and variance matrix I p . 

Example 2.1. Suppose X = (X l ,X 2 ) T ~ N(0,I 2 ) and Y = X\ + ere, 
with e JL X and e ~ N(0, 1). For this regression Sy\x is the one-dimensional 
span of (5 = (0, l) r , and the conditional variances on the left- and right-hand 
sides of (2) are, respectively, 

X 1 = E[(X 1 -X 1 ) 2 \\Y -Y\<c] and A 2 = E[(X 2 - X 2 f \ \Y - Y\ < c]. 

Because X\ is independent of X 2 and e, it is independent of Y. Therefore the 
first conditional expectation equals the unconditional expectation E(X\ — 
X\) 2 , which is 2. We have computed the second conditional expectation 
numerically on a grid of values c = 0.1, 0.5, 1, . . . , 3 and a = 0.1, 0.2, 0.3, . . . , 2. 
In all cases A2 < Ai = 2. Below is the tabulation corresponding to a = 0.3 
and selected values of c: 

c 0.10 0.50 1.00 1.50 2.00 2.50 3.00 
A 2 0.85 0.90 1.04 1.20 1.34 1.45 1.54 

We note that this is the case where OLS and SIR will estimate zero, providing 
no information about the central subspace. 
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Example 2.2. Let X = (Xi,X 2 ) T and e be as in Example 2.1, and 
Y = {X.2 — l) 3 + ere. We again used numerical integration to compute A2 
for c = 0.1, 0.5, 1, . . . , 3 and a = 0.1, 0.2, 0.3, . . . , 2, and again obtained values 
below 2 in all cases. A table for a = 0.3 and selected values of c follows: 

c 0.10 0.50 1.00 1.50 2.00 2.50 3.00 
A 2 0.24 0.28 0.36 0.45 0.53 0.61 0.67 

We checked numerically numerous other response functions such as poly- 
nomials, exponential and logarithmic functions, trigonometric functions and 
so on, never encountering a violation of Assumption 2.1. In the next exam- 
ple, we verify Assumption 2.1 for a regression with a binary response. This 
example also helps to emphasize that Assumption 2.1 should be imposed on 
the standardized, rather than the original, predictor. 

Example 2.3. Let Y ~ Bernoulli(l/2) and X £ M 2 . The regression is de- 
scribed through the inverse conditional distribution: X\(Y = y) ~ N(n y ,l2), 
with fio = (0, — 1) T and [i\ = (0, 1) T . Thus, here Sy\x is again the one- 
dimensional span of = (0, 1) T and the conditional variances on the left- 
and right-hand sides of Assumption 2.1 are Ai and A2 as defined in the 
previous examples. Simple calculations show that, for < c < 1, 

E[(X -X)(X- X) T \Y = Y] = E[(X -X)(X- X) T \Y = Y = 0] 

= 2E(XX T \Y = 0) - 2E(X\Y = 0)E(X T \Y = 0) 

= 2var(A|y = 0) =2I 2 . 

Therefore, Ai = A2 = 2, and Assumption 2.1 fails. However, if we standardize 
the predictor vector at the outset, the assumption holds also for this binary 
regression. Note that E(X) = and 

S = var(X)=(; °). 

It follows that 

E[(Z - Z)(Z - Z) T \Y = Y] 

= 5r 1/2 £[(x -x){x- x) t \y = y]xr 1/2 = 2s" 1 . 

Because E is diagonal, the central subspace S Y \z for the regression of Y on 
the standardized predictor Z = Yi~~ l / 2 X is still the span of (0, 1) T (see, e.g., 
[4], page 106). Thus, on the Z-scale Ai = 1 < 2 = A2. 

In the next theorem and corollary we prove that if X is elliptically con- 
toured and Assumption 2.1 holds, then the population vectors from SCR 
exhaust the central subspace Sy\x- We first consider the standardized X. 
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Theorem 2.1. Suppose that X has an elliptical distribution with E(X) = 
and var(X) = I p . If Assumption 2.1 holds, then the eigenvectors of K(c) cor- 
responding to its smallest q eigenvalues span the central subspace Sy\x- 

Proof. Suppose, without loss of generality, that the q columns of (5 
form an orthonormal set in MP. Let 71, ... , Jp-q be an orthonormal basis for 
(Sy\x)' L - We need to show that (i) 71, . . . , 7 p - q are eigenvectors of K(c), and 
(ii) their corresponding eigenvalues are the largest among its eigenvalues. If 
(i) and (ii) hold, then the eigenvectors of K(c) corresponding to its smallest 
q eigenvalues will coincide with span(/3) =Sy\x> as desired. 

(i) Let B = (71, . . . , 7 P _ 9 , ft, . . . ,p q ) and W = B T X. By Proposition 6.3 
of [4], page 106, the central subspace for the regression of Y on W, Sy\w> ls 
spanned by the vectors B~ l f3\, . . . ,B~ l (5 q . Note that -B _1 /3j = e p - q+ i, that 
is, the vector in MP with a 1 in the (p — q + z)th position and 0's in all 
remaining ones. By construction W still has a spherical distribution with 
mean and variance I p . Next, let (X,Y) be an independent copy of (X,Y), 
W = B T X, and 

Kx{c) = E[(W - W){W - W) T I \Y - Y\ < c). 

It is easy to see that K\(c) = B~ l K(c)B, and hence that 7, is an eigenvector 
of K{c) if and only if B T -)i is an eigenvector of K\{c). Thus, it suffices to 
show that e\ , . . . , e p - q are eigenvectors of K\ (c) . 

Let 4>i : MP 1— > MP, i = 1, . . . ,p — q, be the map that changes the sign of the 
ith element of a vector in M p [e.g., </>i(x) = (—x\,X2, ■ ■ ■ ,x p ) T }. Observe that 
<j>i is an invertible mapping with <j>~ 1 = (pi, and that x, 4>i{x), . . . , <f) p - q (x) are 
on the same sphere centered at the origin. Let C be any measurable set. 
Then 

Pr(W eC) = Pt(W G <j>i{C)) = Vr{^ l {W) G C) = Pt(</h(W) G C), 

where the first equality follows because W has a spherical distribution and 
x>— > (f>i{x) is an orthogonal transformation, and the third equality follows 
because fa = cj)~ l . Thus W and fa(W) have the same distribution. Further- 
more, because of the conditional independence, 

Y A. (Wi, . . . ,W p )\Wp-q+i, . . . ,W P , 

the conditional distributions of Y\W and y|<^j(VF) (where i = 1, . . . ,p — q) 
are both equal to the conditional distribution of Y"| Wp_ 9+ i, . . . , W p . It follows 
that iW,Y) and (fa(W),Y) have the same distribution. Consequently, 

K^c) = E[{W - W){W - W) T \\Y-Y\<c] 

= E[(fa{W) - fa(W))(fa{W) - fa(W)f I |F-y|< c]. 
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So the matrix K\(c) can be re-expressed as the average of the right-hand 
sides of the first and the second lines above; that is, 

E[^(W — W)(W — W) T + ^(<pi(W) — cf)i(W))((fti(W) — <fti(W)) T | — < c]. 

Denote the above matrix by E(Ai \ \Y — Y\ < c) in the obvious way. Then it 
is easy to see that the (i, i)th element of is (Wi — Wi) 2 and the (i, j)th and 
(j, i)th elements are identically whenever j ^ i. Hence, for i = 1, . . . ,p — q, 
6{ is an eigenvector of K\(c) corresponding to the eigenvalue Aj = E((Wi — 
Wi) 2 \\Y-Y\ <c). 

(ii) From (i) it follows that 7i,---,7 p - g are eigenvectors of K(c) corre- 
sponding to the eigenvalues Ai, . . . , A p _ g . Let 7 p _ g+ i, . . . ,7 P be the other q 
eigenvectors of K(c), corresponding to the eigenvalues A p _ g+ i, . . . , X p . By 
construction these q vectors span Sy\x- The eigenvalues Aj, i = 1, . . . ,p, can 
be expressed as 

^ = 7j T ^(c)7i = var[ 7 f(X - X) \ \Y - Y\ < c]. 

But because 71, . . . ,7 P -g are vectors in (S Y \x)' L an d 7^-5+1, • • • ,7 P are vec- 
tors in Sy\xi Assumption 2.1 suffices to ensure that Ai,...,A p _ g are the 
largest eigenvalues of K(c). □ 

Let us now turn to X with an arbitrary elliptically contoured distribu- 
tion having mean /i and nonsingular variance matrix E. Theorem 2.1 asserts 
that if we let Z = Y,~ l l 2 {X — /i), then the eigenvectors 7 p _q + i, . . . , 7 P corre- 
sponding to the smallest q eigenvalues of the matrix E[(Z — Z)(Z — Z) T \ 
\Y — Y\ < c] span Sy\z- Consequently, by Proposition 6.3 of [4], page 106, 
the vectors S~ 1 / 2 7 P _ g+ i, . . . , S~ 1 / 2 7 P span Sy\x- Thus we have the following 
generalization of Theorem 2.1: 

Corollary 2.1. Suppose that X has an elliptically contoured distribu- 
tion with mean fi and nonsingular variance matrix T,. Suppose that Assump- 
tion 2.1 holds with X and X replaced by £ -1 / 2 (X — fi) and £ -1 / 2 (X — //). 
Let 7p_q+i, . . . ,7 P be the eigenvectors of the matrix 

E~ 1/2 E[(X -X)(X- X) T \\Y-Y\< c]T,- 1/2 

corresponding to its smallest q eigenvalues. Then the vectors S~ 1 / 2 7 p _ g+ i, . . . , X 
span Sy\x- 

Note that since the corollary postulates Assumption 2.1 on the standard- 
ized predictor, it does also apply to regressions with discrete responses, such 
as the one described in Example 2.3. 
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2.3. Estimation and ^fn-exhaustiveness. We now construct a sample es- 
timate of the matrix K{c). As before let (X ,Y) be an independent copy of 
(X,Y), and consider the matrix 

(3) H{c) = E[(X - X)(X - X) T I{\Y -Y\< c)]. 

Since K{c) and H (c) differ only by the proportionality constant Pr(|Y — Y \ < c 
their eigenvectors coincide. We will thus consider an estimate of H{c) for 
simplicity. Let (Xi, Y\ ),..., (X n , Y n ) be an independent sample from the 
random pair (X,Y). The estimating procedure will mimic the theoretical 
development in Section 2.2: 

(a) Compute sample mean and variance matrix of the predictor X: 

n n 

1=1 1=1 

(b) Compute the matrix- valued [/-statistic: 

(4) H{c) = ±- (Xf-XiXXf-XifldYj-Yll^c), 

w {i,j)€N 

where iV is the index set {(z, j) :i = 2, . . . ,n; j = 1, . . . ,i — 1}. 

(c) Compute the spectral decomposition of S~ 1 / 2 i?(c)S -1 / 2 and let 7 p+g _i, 
be the eigenvectors corresponding to the smallest q eigenvalues. 

(d) The span of these eigenvectors estimates Sy\Zi where Z is the stan- 
dardized version of X. Thus, our estimate of the central subspace is 

S Y \x = span(S- 1/2 7 p _ 9+ i , . . . , Xr 1/2 7 p ) . 

In practice, as for other sufficient dimension reduction methods, a testing 
procedure will be needed to determine statistically how many eigenvectors 
to take. Here we assume the dimension q of the central subspace to be 
known, leaving the development of an appropriate testing procedure for 
future investigation. However, in the next section we lay the groundwork for 
constructing such a procedure by further exploring the eigenvalue structure 
of the matrix K(c). Before doing so, we demonstrate the -y/n-exhaustiveness 
of SCR. 

Theorem 2.2. Suppose that S is nonsingular and that the components 
of X have finite fourth moments. Then 

Proof. Since X has finite fourth moment, E is a -\/n-consistent esti- 
mator of E by the central limit theorem. Since E is nonsingular, we have 
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that £ l / 2 is a -y/n-consistent estimator of £ 1 I 2 by the continuous mapping 
theorem. It follows that 

= ^-^(Hic) - H(c))^- 1/2 + Opin- 1 / 2 ). 

Next, let vec(-) be the operator that stacks the columns of a matrix [for 
A with columns a±, . . . , a&, vec(A) = (aj , . . . , a^) T ), and let vec T (-) denote 
the transpose of vec(-). Note that H(c) is a matrix-valued [/-statistic with 
elements having finite second moments. By the central limit theorem for 
U statistics ([20], Chapter 5) it easily follows that T,/nvec(H(c) — H(c)) 
converges in distribution to a p 2 -dimensional multivariate normal vector 
with mean and variance matrix 

4£[vec(#i(A^Y;c))vec T (#i(X,Y;c))] - 4vec(i/(c))vec T (#(c)), 

where H\(X, Y;c) is the conditional expectation 

E[{X - X){X - X) T I(\Y - Y\ < c)\X,Y}. 

Consequently H(c) — H(c) = O^l/y^), which completes the proof. □ 

As a consequence of this theorem, 7 p _q_|_i, . . . , 7 P provide a y'n-exhaustive 
estimator of Sy\z, and hence S _1 / 2 7 p _ g+ i, . . . , S -1 / 2 7 p provide a -y/n-exhaustive 
estimator of Sy\x ■ 

The role played by the constant c here is similar to the width of a slice in 
SIR and SAVE. It differs from the width of a kernel in a typical nonpara- 
metric estimator in that it need not go to as n — > oo. The thresholding 
can actually be implemented in two ways: fixing a numerical value for c, 
or fixing a proportion of empirical directions Xi — Xj [out of Q)]. This 
distinction is relevant for theoretical analysis and simulation. We find that 
using the 5-15% empirical directions ranking lowest in terms of response 
absolute difference works well in simulation studies (see Section 6). A more 
careful investigation of the thresholding rule is important, but goes beyond 
the scope of the present article — we expect good thresholding to depend on 
the dimension p, q, and possibly other factors. 

The asymptotic analyses we present here are all carried out for the thresh- 
olding based on a fixed value of c. However, they can be easily paralleled for 
thresholding based on a fixed proportion: Modulo the fact that c need not go 
to as n — > oo, the comparison of these two thresholding options is analogous 
to that between a kernel and a nearest-neighbor estimator [18, 21, 22]. 
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2.4. Toward testing hypothesis. In this section we will show that, under 
an additional assumption, the largest p — q eigenvalues of E _1 / 2 i ; C(c)S~ 1 / 2 
are identically 2, so that 71, . . . ,7p— g are the eigenvectors of 2I p — Yi~ l l 2 K{c)Yi~ 1 / 2 
corresponding to eigenvalues equal to 0. This paves the way for construct- 
ing a test statistic to determine the dimension of Sy\z (and hence Sy\x)i 
because the problem now is converted into testing how many of the small- 
est eigenvalues of 2I p — S _1//2 ET(c)S _1//2 are 0. Tests of this type can be 
constructed using the asymptotic distribution of small singular values de- 
veloped by Eaton and Tyler [10]. Similar tests have been constructed in 
other contexts in [15, 16]; [4], Chapter 11 and [2, 5, 14]. We expect that 
a test statistic and related sampling distribution for SCR can be obtained 
analogously. The additional assumption is usually referred to as the constant 
conditional variance assumption, and is often evoked when developing such 
tests: 

Assumption 2.2. If j3 is a matrix whose columns form a basis in Sy\x> 
then the conditional variance var(X\(3 T X) is a nonrandom matrix. 

The next lemma states various implications of conditional independence, 
which will be used in the subsequent development. 

Lemma 2.1. (a)IfVx,...,Ve are random vectors satisfying (V\, V2, V3) _L 
-L (V A ,V 5 ,V 6 ), ViJLV 2 |V 3 andV 4 JLV 5 \V 6) then (Vi, Vi) JL (V 2 , V 5 )\(V 3 , V 6 ). 

(b) If Vi, . . . , V4 are random vectors satisfying (Vi, V2) JL (V3, V4), then 
V 1 ±V 3 \(V 2 ,V 4 ). 

(c) If Vi,V2,Vs are random vectors satisfying (Vi, V 2 ) JL V3, then V\ J_ 

J- v 3 \v 2 . 

Part (c) is a special case of a well-known result; see, for example, [8] and 
[4], Proposition 4.6. The proofs of (a) and (b) are similar to those used in 
these papers, and are given in the Appendix. 

Theorem 2.3. Suppose that X has an elliptical distribution and that 
Assumptions 2.1 and 2.2 hold. Let q be the dimension of Sy\x- Then the 
p — q largest eigenvalues of S" 1 / 2 i('(c)S~ 1 / 2 are identically 2. 

Proof. Let (5 be a p x q matrix whose columns span Sy\x, and Z = 
E -1 / 2 (X — (u). Then the columns of 77 = S 1 / 2 /? span S Y \ Z and ^"^^(c)^" 1 / 2 
is the matrix 

K x (c) = E[{Z - Z){Z - Z) T I \Y - Y\ < c] = E(A 1 Af | |A 2 | < c), 
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where we have abbreviated Z — Z and Y — Y by Ai and A2, respectively. 
Because {Z, Y, rf Z) JL (Z, Y, rj T Z), ZJLY\rj T Z and Z±Y\rj T Z, we have, 
by Lemma 2.1(a), that (Z,Z) JL (Y, Y)\rj T Z, rj T Z. This in turn implies that 
Ai JL A2\rj T Z, r] T Z . Consequently, 

£?(AiAf I |A 2 | < c) = E[E (Ai Aj \t] T Z,r] T Z, |A 2 | < c) | |A 2 | < c] 

= ^(AiAfl^Z.T^Z) I |A 2 | < c]. 

The conditional expectation inside the brackets on the right-hand side can 
be decomposed as the sum of four terms: 

E(ZZ T \r) T Z, ifZ) - E(ZZ T \rj T Z, if Z) 
^ ' - E(ZZ T \r] T Z,r] T Z) + E(ZZ T \ri T Z,r] T Z). 

Since (Z,r] T Z) JLt] T Z, we have Z JL r] T Z\rj T Z by Lemma 2.1(c). Hence the 
first term becomes 

E(ZZ T \ifZ) = yav(Z\rj T Z) + E(Z\if Z)E(Z T \r] T Z). 

Let P = r](rj T r])~ l rj T be the orthogonal projection onto Sy\z an d let Q = I — 
P be the orthogonal projection onto (<Sy\z) ± - Then it can be shown by As- 
sumption 2.2 that vav(Z\r] T Z) = Q. Because Z has a spherical distribution, 
E{Z\r) T Z) = PZ, so that the first term in (5) reduces to Q + PZZ T P. By 
Lemma 2.1(b), the second term in (5) factorizes into — E(Z\rj T Z, rj T Z)E(Z T \rj T Z, t] T Z), 
which by Lemma 2.1(c) further reduces to — E{Z\r} T Z)E(Z T \r] T Z). Hence, 
again using sphericity of Z, the second term in (5) is —PZZ T P. By similar 
arguments the third and fourth terms in (5) are -PZZ T P and Q + PZZ T P, 
respectively. Therefore 

(6) E(A 1 Aj I I A 2 | < c) = 2Q + P#(AiAf | |A 2 | < c)P. 

Let v be a vector in (5y|^)^, and multiply this matrix by v T from the left 
and by v from the right to obtain vai(v T A\ | |A 2 | < c) = 2. This completes 
the proof. □ 

Note that without Assumption 2.1 the theorem still holds to an extent, in 
the sense that the eigenvectors of E(AiAj \ |A 2 | < c) orthogonal to span(?y) 
still have eigenvalues equal to 2. However, without this assumption exhaus- 
tiveness would be lost, because we cannot rule out the possibility that eigen- 
values other than these may also be 2. 

This eigenvalue structure is similar to that of SAVE. For some c > and 
y, let S(c) be the sliced averaged variance 5(c) = var(A | \Y — y\ < c). In 
this notation I p — S(c) is the SAVE matrix for a slice centered at y. Under 
ellipticity and Assumption 2.2, we have 

S(c) = Q + PS(c)P 



CONTOUR REGRESSION 



13 



(see [7]). Thus the eigenvalues corresponding to the eigenvectors of S(c) that 
are orthogonal to the central subspace are identically 1. However, there are 
important differences between contour regression methods and SAVE; these 
will be briefly discussed at the end of Section 4.3. 

3. Sufficient conditions for exhaustive estimation. In order to place the 
theory of simple contour regression on a firmer foundation, we devote this 
section to deriving a sufficient condition for Assumption 2.1. As shown in the 
previous sections, if this assumption holds, then SCR provides y/n- exhaustive 
estimation of the central subspace Sy\x] that is, the estimating vectors con- 
verge with y^-rate to a set of vectors that span S Y \x m its entirety. Suffi- 
cient conditions of this type are extremely elusive; to our knowledge none has 
been established with reasonable generality for other -^/n-consistent methods 
such as OLS, PHD, SIR or SAVE. Results from an early, prescient paper 
by Peters, Redner and Decell [19] lead to exhaustiveness of SAVE under 
the condition that X\Y is multivariate normal. However, this condition is 
very restrictive — note that even in a typical location regression of the form 
Y = f(X) + e with X and e independent and both normally distributed, 
this assumption is not met unless /(•) is linear. Because of its generality, 
the sufficient condition given here for SCR is the first of its kind. 

We will need the notion of stochastic ordering. Let S and T be two random 
variables. We say that S is stochastically less than or equal to T if, for 
any real number r, Pr(S* < r) > Pr(T < r), and write this as S <d T. If, in 
addition, the inequality is strict on a subset of the real line with positive 
Lebesgue measure, we say that S is stochastically (strictly) less than T and 
write S <d T. The following lemma is obvious, and its proof will be omitted. 

Lemma 3.1. Suppose that S and T are random variables taking values 
in a common set Q C K, and that S <d T . Then: 

(a) E(S)<E(T). 

(b) Given a monotone real-valued function g:Q, \— > R, g(S) <d g(T) if g(-) 
is increasing, and g(T) <dg(S) if g(-) is decreasing. 

As a special case, consider a pair of random variables (S, T). We will write 
(S\T = h) < d {S\T = t 2 ) if, for any r, Pr(S < r\T = h) > Pr(5 < r\T = t 2 ), 
with strict inequality held on a set with positive Lebesgue measure. 

The following lemma, which is proved in the Appendix, will also be used. 

Lemma 3.2. Let p(s) and q(s) be the densities of nonnegative random 
variables S and T taking values in a common support Q, C M. + , and suppose 
that p(s)/q(s) is decreasing in s. Then E(S) < E(T). 
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In developing a sufficient condition for Assumption 2.1, we restrict our- 
selves to a location structure, that is, to regressions of the kind 

(7) Y = f((3 T X) +ae, e±X,E(e) = 0. 

Ultimately, the sufficient condition will be imposed on the behavior of /(•). 
Let (X,e) be an independent copy of (X, e), A = X — X , T = e — e, and 
let Ft(-) be the cumulative distribution function of T. For the statement of 
the following theorem, it will be more informative to write f((3 T x) merely 
as g(x). 

Theorem 3.1. Suppose that X has an elliptically contoured distribution 
with E(X) = and var(X) = I p , and that Assumption 2.2 holds. Moreover, 
suppose that model (7) holds with the density /t(£) of Fx(t) being a de- 
creasing function of \t\. If for any a G Sy\x an d whenever < S\ < 62 we 
have 

(8) \g(X + A) - g(X)\ | {\a T A\ = 8{\ < d \g(X + A) - g(X)\ \ {\a T A\ = 5 2 }, 
then Assumption 2.1 holds for every c> 0. 

Before proving the theorem, let us comment on its significance. To under- 
stand the intuition behind condition (8), first consider the case where X is a 
scalar random variable. Intuitively, condition (8) should hold trivially if g is 
a monotone function, because it holds pointwise in X = x with <^ replaced 
by ordinary inequality < (see Example 3.1 below). However, condition (8) 
by no means restricts g(-) to being monotone, because being stochastically 
large or small is an average behavior for all values of X , and does not require 
being large or small for every single value X = x. It then does seem to make 
sense to assume that g(X + A) is collectively farther away from g(X) if A 
is larger: this is simply requiring g to be reasonably variable. In the multi- 
variate case, condition (8) requires this to hold along any direction a in the 
space Sy\x-> which is the space along which g(x) does vary. Also the require- 
ment that frif) decreases with \t\ is not a severe restriction, considering that 
this density is symmetric about by construction. Finally, the result can be 
generalized straightforwardly to nonstandardized elliptical predictors. Thus, 
Theorem 3.1 allows us to conclude that, for elliptical predictors with con- 
stant conditional variance along the central subspace (Assumption 2.2), all 
that is required to guarantee Assumption 2.1 — and hence exhaustiveness of 
SCR — are very mild conditions on the behavior of the mean function and 
the error term in (7) (some instances are provided after the proof). 

Proof of Theorem 3.1. Let a e S Y \x and £ £ {Sy\x) L - Theorem 2.3 
implies that var(£ T (X — X) \ \Y — Y\ <c) = 2, which is the same as the 
unconditional variance vai(a T (X — X)). Hence, it suffices to show that 

var(a T (X — X) \ \Y - Y\ < c) < var(a T (A > - X)). 
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Let U = \Y-Y\ and V = (a T (X-X)) 2 . We are then to show that E(V\U < 
c) < E(V). Let f v (-) be the density of V. Then 

f°° Pr(U<c\V = v) 
(9) E(V\U<c)=j Q v p r ^ c) } fv{v)dv. 

Now, let r{v) = Pr(C7 < c|V = v)/~Pr(U < c). Then r{v)fv{v) is itself a den- 
sity on M + . By Lemma 3.2, if we can show that r{v) is a decreasing function 
of v, then the right-hand side of (9) is smaller than / vfv(v) dv and the proof 
is complete. So let us show that Pr(C/ < c\V = v) decreases in v. Note that 

Pt(U < c\V = v)= E{Pr(U < c\X, X)\V = v}. 

Because (X, X) is independent of we can re-express the conditional 

probability Pr([7 < c\X,X) as 

PrfopO - - c < T < g(X) - g(X) + c) 

= F T ( 5 (X) - g(X) + c) - F T ( 5 (X) - - c). 

Because the roles of X and X can be exchanged, and because V = (a T (X — 
X)) 2 = (a T (X -X)) 2 , we have 

Pr([/ < C \V = v) = E[F T (g(X) - g(X) + c) - F T (g(X) - g(X) - c)\V = v] 
= E[F T (g(X) - g(X) + c) - F T (<7(X) - 5 (X) -c)\V = v\. 

So Pr(C/ < c|y = v) can be written as the average of the two expressions on 
the right-hand sides of the first and second equalities in the above display. 
That is, if we write g(X) — g(X) + c as A and g(X) — g(X) — c as —B, 
then Pr(C7 < c\V = v) can be written as E(F T (A) - F T {-B) + F T (B) - 
Ft{—A))/2. However, since Ft(-) is the cumulative distribution function of 
a symmetric density, we have Fr(t) — Fr(—t) = 2i<r(i) — 1 for any t. Hence 

Pv(U <c\V = v) 

= E[F T (g(X) - g(X) + c) + F T (g(X) - g(X) +c)\V = v]-l 

= E[F T (R + c) + F T (-R + c)\V = v]-l = E{G{R)\V = v)-l, 

where R = \g(X) - g{X)\. Thus it suffices to show that E(G(R)\V = v) 
decreases with v. Because /t(-) is symmetric about 0, 

G'{r) = f T (r + c) - f T (-r + c) = f T (\r + c|) - f T (\r - c\). 

But since r and c are both positive, r + c > |r — c\. It follows that the 
right-hand side is negative, and G(r) is strictly decreasing in r for r > 0. By 
Lemma 3.1(a), E(G(R)\V = v) will be decreasing in v if, for any V2 > v\ > 0, 



(10) 



(G{R)\V = v 1 )< d (G(R)\V = v 2 ). 
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However, because G(R) is a decreasing function of R, by Lemma 3.1(b) 
inequality (10) will hold if (R\V = v±) <d (R\ V = v 2 ). The latter inequality 
is equivalent to (8). □ 

To illustrate the generality of this sufficient condition we now verify it for 
some examples. Let us consider the following specialization of our location 
structure. In (7) take X ~ iV(0, I 2 ) and [3 = (0,1) T , so that Y = f(X 2 ) + ere. 
Consider the conditional probability 

Pr(|/(X 2 ) - f(X 2 )\ <r\\X 2 -X 2 \ = 5). 

Condition (8) will be satisfied if, for each r > 0, this quantity decreases in 5. 
Because the distribution of X 2 — X 2 is symmetric about zero, this probability 
is 

Pr(|/(X 2 ) - f(X 2 )\ < r\X 2 -X 2 = <5)/2 

-f Pr(|/(X 2 ) - f(X 2 )\ <r\X 2 -X 2 = -S)/2. 

Because the roles of X and X can be interchanged, the conditioning argu- 
ment X 2 — X 2 = —5 in the second term can be replaced by X 2 — X 2 = ft and 
hence 

Pr(|/(X 2 ) - f(X 2 )\ <r\ \X 2 -X 2 \=6) 

= Pr(|/(X 2 ) - f(X 2 )\ < r\X 2 -X 2 = 6) = *(5). 

Thus (8) will hold if, for each r > 0, ^(S) is a decreasing function of 5 > 0. 
Now X 2 and X 2 can be written as S + T and S — T where S = {X 2 + 
X 2 )/2 and T = (X 2 — X 2 )/2. Note that, by normality of X, S and T are 
independent. Hence, 

(5) = Pr(|/(5 + T) — f(S — T)\ <r\T = 5) = Pr(|/(5 + 5)-f(S-5)\< r). 

So for this specialization of our location structure we only need to verify 

\f(S + ft) - f(S - St)\ < d 1/(5 + S 2 ) - f(S - 5 2 )\ 

for S ~ ^(0,1/2) and for any < 5\ < 5 2 . The following examples both 
reduce to verifying this inequality. 

Example 3.1. Suppose that f(x 2 ) is a continuous and monotone func- 
tion which without loss of generality can be assumed to be monotone increas- 
ing. Then for any ft < ft, \f(s + ft) - f(s - ft)| < d \f(s + ft) - f{s - ft)|. 
To see that the strict inequality (<d) holds, let r be a real number in the 
set {/(s + ft) — f(s — ft) : s G 1R}. By continuity there is an sq such that 
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f(so + S{)- f(a -5{) =r <f(s + S 2 )- f(s -S 2 ). Hence, in the neighbor- 
hood of (s -r,s ), f(s + Si) -f(s-8i) <r < f(s + 5 2 )- f{s-5 2 ). Writing 
|/(5 + S) - f(S - 5)| as R(5), we have 

Pr(-R(<5i) < r) = Pr(i?(<5i) < r, i2(<5 2 ) < r) + Pr(i?(<5i) < r, R(S 2 ) > r) 

= Pv(R(5 2 ) <r) + Pr(i2(5i) < r, fl(5 2 ) > r). 

Because the set {s : R(S±) < r, R(5 2 ) > r} contains an open interval, Pr(i?(#i) < 
r,R(5 2 ) > r) > 0, and consequently Pr(i?(Ji) < r) < Pr(R(5 2 ) < r). Because 
Pr(i?((5i) < r) — Pr(i?(<5 2 ) < r) is continuous in r, this inequality holds in an 
open interval around r, which has positive Lebegue measure. 

Example 3.2. Let f(x 2 ) = (x 2 — a) 2 . Example 2.1 is a special case of 
this regression with a = and e ~ -/V(0, cr 2 ). In this case <5) = 4|s — a\6. 
Hence for < S\ < S 2 , R(s;8i) < R(s;5 2 ) for all s. By an argument similar 
to the one in Example 3.1, it is easy to see that R(S;5i) <d R(S;8 2 ). 

4. General contour regression. 

4.1. Estimation. The idea underlying SCR is to use the inequality \Y — 
Y\ < c to identify vectors aligned with the contour directions. However, this 
inequality also picks up other directions when the regression function is 
nonmonotone. Under ellipticity such directions are averaged out, so that 
the method remains -^/n-exhaustive. Nevertheless, these "wrong" directions 
do tend to decrease efficiency by blurring up the "right" ones. In other 
words, the inequality \Y — Y \ < c is not a very sensitive contour identifier 
for nonmonotone functions — even though it is sufficiently sensitive to main- 
tain -y/n-exhaustiveness. We now illustrate this point using the regression in 
Example 2.1. 

To construct the left panel of Figure 1, we generated twenty observa- 
tions (Xi, Yi), i = 1, . . . , 20, according to the regression in Example 2.1, with 
a = 0.3. We then used the threshold value c = 0.5, connecting by a solid 
line segment any two points Xi,Xj G R 2 satisfying \Yi — Yj\ < 0.5. Roughly 
speaking (note that we have ignored the rescaling issue which has little bear- 
ing on this discussion), SCR picks up the contour directions by a principal 
component analysis of the vectors represented by these line segments. We see 
that, though most of the segments are horizontal (i.e., aligned with the true 
contour direction), there are a considerable number of segments pointing to 
arbitrary directions. This is because the response surface is [/-shaped and 
the inequality \Yi — Yj\ < 0.5 does not discriminate between the segments 
aligned with the contour and those across the [/-shaped surface that also 
have small increments in Y. Though the arbitrary directions tend to average 
out due to the ellipticity of the distribution of X, they make the picture less 
sharp and the method less efficient. 
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To overcome this drawback we replace the contour identifier \Y{ — Yj\ < c 
by a more sensitive one. Consider the variance of Y along the line through 
Xi and xj. Formally, let £(t;Xi,Xj) = (1 — t)xi + txj, t 6 M, be the straight 
line that goes through Xi and Xj, and define 

V(xi,Xj) = var(Y\X = £(t;xi,Xj) for some t). 

For a more concrete expression, let 5(xi,Xj) be the p x (j> — 1) matrix 
(Si,... ,6 p -x) whose columns form a basis in (xj — Xi)^. Then V(xi,Xj) can 
be re-expressed as 

(11) V(xi,Xj) =var(Y\5 T (xi,Xj)X = 5 T \xi,Xj)xi). 

We will aim at identifying contour vectors by the smallness of this condi- 
tional variance. 

The next task is to construct a sample estimate of V(Xi,Xj). We will 
denote the line t(-;Xi,Xj) by £(Xi,Xj). For any X k , let d(X k ,£(Xi,Xj)) be 
the Euclidean distance between X k and the line £(Xi,Xj); that is, 

d{X k ,£{X i ,X j ))=mm\\X k -£(t;X i ,X j )\\, 

where || • || stands for the Euclidean norm. Because \\X k — £(t; Xi, Xj)\\ 2 is a 
quadratic function of t, this minimum distance can be expressed explicitly 
as 



d{X k ,l{X it Xj)) 



\Xh — Xi 



|2 {(Xk-XifjXj-Xi)} 2 ] 1 / 2 



\Xj — Xi 



For any p > 0, we define the tube of radius p connecting Xi and Xj to be 
the set 

CM = {X k : d{X k ,t{XM) <p,k = l,...,n}. 
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According to this definition, each tube contains at least two points in the 
sample. Next we estimate the variance of Y along these tubes. Let riij(p) be 
the number of points in the tube Cjj(p), and let 

V(X i ,X j ;p) = ^ T - (Yk-Mp)) 2 , 
ni ^ p > x k ec tj (p) 

where F„ (p) = -J— ]T Y k . 

ni ^ p > x k e Cij ( P ) 

We can now identify the contour directions by the smallness of V(Xi, Xj] p). 

Plotted in the right panel of Figure 1 are the same sample points as in 
the left panel, but with the line segments picked up by V(Xi,Xj; p) < c, 
where c = 0.5 and p = 0.3. We can see that many of the segments pointing 
to random directions in the left panel have been removed. To get a quanti- 
tative comparison, we calculated the first principal component for the line 
segments in each panel, which equals (0.9169, 0.3991) T for the left panel 
and (0.9991, — 0.0417) T for the right panel. The latter is much closer to the 
direction (1,0) T , the population vector orthogonal to Sy\x- 

We now construct the estimator of Sy\x- Along lines similar to those 

followed in Section 2, we standardize the predictor observations to Zi = 
E _1 ' 2 (Xj — p), and form the matrix 

(12) F(c) = -±r £ (Z 3 - Z^Zj - Z t fl(V(Z u Z r ,p) <c), 
^2/ (i,j)eN 

where N is the same index set as used in (4). The matrix F(c) takes the place 
of S~ 1 / 2 ff(c)S~ 1 / 2 for the simple contour regression. As in SCR, we take 
the spectral decomposition of F(c), and use 7 p+(? _i, . . . ,7 P , the eigenvectors 
corresponding to the smallest q eigenvalues, to form 

Sy\x = span(S~ 1/2 7 p ^ + i, . . . , S" 1/2 7 P ). 

Regarding the choice of c, comments similar to those made at the end of 
Section 2.3 apply here. In particular, as a rule of thumb we propose to use 
5% to 15% of the (2) empirical directions. 

4.2. Population-level exhaustiveness. Assume that X is already stan- 
dardized to E{X) = and var(X) = I p (so Z is X itself). The population 
version of the matrix F(c) in (12) is 

F(c) = E[{X- X)(X - X) T I(V(X,X) < c)}, 

which is proportional to the matrix 

G(c) = E[(X - X){X - X) T \V{X,X) < c}. 
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Here we will demonstrate that, for sufficiently small c, the eigenvectors corre- 
sponding to the smallest q eigenvalues of G(c) span S Y \x- F° r this purpose 
we introduce an assumption that parallels Assumption 2.1. Again (X,Y) 
indicates an independent copy of (A, Y). 

Assumption 4.1. For any choice of vectors v £ S Y \x an d w G (Sy\x) ± 
such that ||u|| = \\w\\ = 1, and some constant c> 0, we have 

(13) vai[w T (X- X)\V(X,X) < c] >vav[v T (X- X)\V{X,X) <c}. 

The interpretation of this assumption is similar to that of Assumption 
2.1, except that V(X, X) replaces \Y — Y\ as the measure of variation of Y 
along the line through X and X. We now deduce population exhaustiveness 
under this assumption. Once again we do so for a spherical predictor without 
loss of generality. 

Theorem 4.1. Suppose that X has an elliptical distribution with E(X) = 
and var(A) = I p . Then, under Assumption 4.1, the eigenvectors of G{c) cor- 
responding to its smallest q eigenvalues span the central subspace Sy\x- 

The proof of this theorem is similar to that of Theorem 2.1 and will be 
given in the Appendix. The generalization to an arbitrary elliptical distri- 
bution for X is similar to Corollary 2.1 and will be omitted. 

We expect -^/n-consistency to hold also for GCR estimation. However, the 
asymptotic analysis is substantially more complex than for the SCR case, 
because the estimator cannot be rendered directly as a [/-statistic. Alter- 
native techniques must be developed for such an analysis. In this paper we 
will not prove -^/n-convergence rate for GCR, but will back up our claim by 
simulation in Section 6. 

4.3. Sufficient conditions for exhaustive estimation. Next, following a 
reasoning similar to that in Section 3, and again in reference to the location 
structure in (7), we derive a sufficient condition for Assumption 4.1. 

Note that, since Sy\x = span(/3), for a p x r matrix 5 (r <p) we will have 

(14) var(/(/3 T X)|<5 T X) >0 

unless span(/3) C span(<5); that is, f{0 T X) is not a function of 5 T X unless 
5 spans a space containing the central subspace. 

Theorem 4.2. Suppose that X has an elliptically contoured distribution 
with E(X) = and var(A) = I p , and that model (7) holds. Then Assump- 
tion 4.1 is satisfied for all sufficiently small c> for which {(x, x) : V(x, x) < 
c} is a nonempty set. 
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Proof. We first show that V(x, x) > a 2 for all x and x, and that equality 
holds for some x and x, so that, whenever c > a 2 , {(x,x) :V(x,x) < c} is 
nonempty. Let 5 be any p x r matrix with r <p. Because e JL ((3 T X, S T X), 
we have by Lemma 2.1(c) /3 T X JLe\5 T X. Hence 

, * var(y|<5 T X = t) =var(/(/J T X)|5 T X = t) +var(e|5 T X = t) 
1 j = var(/(/? T X)|5 T X = t)+cr 2 , 

where for the second equality we have used the independence between e and X 
Now take 5 = 5(x, x) and t = 5 T (x, x)x, where 5(x, x) is as defined above dis- 
play (11). We see that V(x,x) > a 2 , and that equality holds whenever (x — x) 
is orthogonal to span(/3) = Sy\x- With this in mind the assertion of the the- 
orem can be rewritten as: if v € Sy\x an d w € (Sy\x)\ then for sufficiently 
small t > 

vav[w T {X - X)\V(X, X)<cj 2 + t]> var[v T (X - X)\V(X,X) < a 2 + r]. 
By the definition of conditional expectation, 

\unvax[(X - X)\V(X,X) <<t 2 + t] = var[(X - X)\V{X, X) = a 2 ]. 

tJ.0 

Hence if we can show that 

(16) vav[v T {X - X)\V{X,X) = a 2 } < var[w T (X - X)\V (X , X) =a 2 }, 

then the inequality will hold for all sufficiently small r > 0, proving the 
theorem. 

To prove (16), note that (15) also implies that var(Y\5 T X = t) = a 2 if and 
only if var(/ '{0 1 X)\5 T X = t) = 0. However, because of (14), this will happen 
if and only if span(/3) C span(J). Taking 5 = 5(x, x) and t = S T (x, x)x, we see 
that V(x,x) = a 2 if and only if span(/3) C span(5(x,x)), which is equivalent 
to (x — x) _L span(/3). Hence the left-hand side of (16) equals 0. 

It remains to show that the right-hand side of (16) is positive. First, note 
that the roles of X and X are exchangeable, and hence 

E[{X - X)\V{X,X) =a 2 } =E[{X - X)\V{X,X) = a 2 ]. 

However, by the definition of V(x,x), V(x,x) = V(x,x) for all x and x, and 
hence 

E[(X - X)\V(X, X) = a 2 ] = E[(X - X)\V(X,X) = a 2 }. 

It follows that both sides of this equation must be 0, and so the right-hand 
side of (16) reduces to E[(w T (X - X)) 2 \V(X, X) = a 2 ). If this quantity were 
0, then w T (x — x) = whenever V(x, x) = a 2 , which holds whenever (3 T (x — 
x) = 0. Because the support of X is spherical, (x — x) can run through every 
direction in W. Hence span(/?)- L C span(w)- 1 -, or equivalently w S span(/?), 
which is a contradiction. □ 
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The conditions in Theorem 4.2 are much weaker than those in Theo- 
rem 3.1. Predictor ellipticity and the structure in (7) are postulated in both 
cases. However, constant conditional variance (Assumption 2.2) is not re- 
quired in Theorem 4.2, and essentially no requirement is posed on the be- 
havior of the mean function and the error term in (7). Thus, GCR will be 
exhaustive under settings even more general than those required by SCR. 
Intuitively, this is because V(X,X) possesses stronger discriminating power 
than \Y — Y\: it can identify the contour vectors of any function f(f3 T X), as 
long as the latter genuinely depends on all the components of [i T X [which is 
the case because of the minimality of the central subspace Sy\x = span(/3) 
discussed in Section 1]. 

As mentioned at the end of Section 2.4, we now briefly discuss differences 
between contour regression methods and SAVE. First, in Section 3 and this 
section we have shown that SCR and GCR are guaranteed to be exhaus- 
tive under population-level conditions much milder than the ones assumed 
for exhaustiveness of SAVE. Second, contour regression methods break the 
barriers of slices, making more efficient use of data. In comparison, slice- 
based methods such as SAVE cannot exploit interslice information. Third, 
whereas the construction of SCR is somewhat similar to that of SAVE, and 
we suspect the gain in accuracy of SCR (which will be demonstrated by 
simulation) to be largely due to its efficient use of interslice information, 
GCR differs more intrinsically from slice-based methods: it employs a more 
sensitive contour identifier, and is thereby capable of picking up directions 
not easily detected by SAVE or SCR when the regression surface is complex. 
In Section 6 we show by simulation how this leads to improved accuracy in 
estimating the central subspace. 

5. Robustness against nonellipticity. The population exhaustiveness of 
our contour-based methodology relies on ellipticity of the predictor distri- 
bution. This is because in the theoretical development we have treated the 
constant c in (4) and (12) as fixed with respect to the sample size n. Ellip- 
ticity of the distribution of X helps to balance out the effect of line segments 
that are not aligned with the contour directions. As mentioned in the Intro- 
duction, ellipticity requirements are ubiquitous for global methods such as 
OLS, SIR, PHD and SAVE. They are adopted to guarantee linear relation- 
ships among predictors, which in turn are needed for the methods to estimate 
directions within the central subspace. When the number of predictors p is 
relatively small, diagnosing and remedying departures from ellipticity is rel- 
atively straightforward — in practice, scatterplot matrices are used to search 
for marked curvatures, and predictor transformations or data reweighting 
to mitigate such curvatures [4, 6]. However, especially when p is large, di- 
agnosing and remedying departures from ellipticity becomes laborious and 
complicated. 
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Notwithstanding the theoretical requirement, contour regression methods 
(especially GCR, whose contour identifier is more sensitive) can perform well 
even under violations of ellipticity. In Section 6 we will address this robust- 
ness by simulation; here we motivate it from a theoretical viewpoint. We will 
show that, postulating again the location structure in (7), the eigenvectors 
corresponding to the smallest p — q eigenvalues of the matrix 

A = E[(X -X)(X- X) T \V(X, X) = a 2 ] 

span the orthogonal complement of the central subspace, (Sy\x) \ even 
when X is not elliptical. This suggests that if we let c decrease to a 2 as n 
increases, then the eigenvectors corresponding to the smallest p — q eigenval- 
ues of F(c) in (12) (after appropriate transformation by £ -1 / 2 ) will tend to 
recover the whole Sy\x> regardless of the shape of the distribution of X . In 
practice, if we make c small [i.e., close to the smallest value of V(Zi,Zj\p) 
in (12)], then GCR is likely to estimate the central subspace exhaustively 
and effectively even if the shape of X does not help the process by averaging 
out erroneous directions, as is the case under ellipticity. 

Theorem 5.1. Suppose that model (7) holds and that X is a continuous 
random vector with an open support X CW. Then the matrix A has exactly 
p — q zero eigenvalues, and their corresponding eigenvectors span (Sy\x)~ L - 
In symbols, 

ker(A) =S Y \x, 
where ker(A) = {h G W : Ah = 0} is the kernel of A. 

Proof. Note that {X — X) is orthogonal to span(/3) = Sy\x if an d only 
if span(/3) C span(5(X, X)), which, by the argument following (15), happens 
if and only if V(X, X)=a 2 . Thus, conditioning on V(X, X) = a 2 , (X - X) 
is orthogonal to span(/3). It follows that, whenever h belongs to span(/?), 
Ah = 0, and thus span(/3) C ker(yl). 

Conversely, suppose h belongs to ker(A). Then 

h T Ah = E[(h T (X - X)) 2 \V(X,X) = a 2 } = 0. 

Thus, whenever V(X, X) = a 2 , h is orthogonal to (X — X). Equivalently, 
whenever (X — X) is orthogonal to span(/?), (X — X) is orthogonal to h. In 
other words, if we let X* = {x — x:x £ X,x £ X}, then 

X* n (span(/3)) ± C X* n (span(/i)) ± . 

However, because X is an open set, X* is an open set containing 0. By 
Lemma A.l in the Appendix (span(/3))- L C (span(/i))- L , or equivalently h G 
span(/3), as desired. □ 
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Intuitively, the theorem shows that the only directions (x — x) along which 
the variance V(x,x) achieves its minimum are those aligned with the con- 
tour. This is largely due to conditioning on the conditional variance, a pop- 
ulation quantity. An analogous result cannot be derived for SCR. 

Theorem 5.1 also suggests that, when we are not confident about the 
ellipticity of the distribution of X, we should use a stricter thresholding in 
the analysis [i.e., choose a small value of c, or include a small proportion of 
the (2) empirical directions] . This makes contour regression estimators more 
similar to kernel estimators, whose consistency depends on the kernel width 
approaching as n — > 00. We will return to this point in the Conclusions. 

6. Simulation results. We now compare the performance of both ver- 
sions of contour regression, SCR and GCR, with that of well-known dimen- 
sion reduction methods ensuring -^/n-consistency, such as OLS, SIR, PHD 
and SAVE. For such comparisons, we need to introduce a measure of dis- 
tance between two subspaces of W. Let S\ and S2 be two q-dimensional 
subspaces of UP and let Pg 1 , Ps 2 be the orthogonal projections onto S\ and 
£2, respectively. We use the distance 

dist(«Si,5 2 ) = ||P 5l -P5 2 ||, 

where || • || is the Euclidean norm, that is, the maximum singular value of a 
matrix. 

In the following, we present five examples covering a range of possible re- 
gression contexts. For both SCR and GCR we need to determine the number 
of empirical directions to include in the principal component analysis, and 
for GCR we also need to determine the tube radius p. Though in this paper 
we will not deal with the optimal choice of these numbers, related issues will 
be discussed to some extent in the examples. 

In the first three examples the sample size n and dimension p are relatively 
small, whereas in the last two examples they are much larger. Instead of 
using a fixed c for thresholding, we fix the proportion r of the number of 
empirical directions with smallest variation (absolute response differences 
for SCR or tube variances for GCR) relative to Q), the total number of 
empirical directions. For the first three examples we use r = 6qn/(^) for 
SCR and r = 2qn/(^) for GCR, and use p = 1 for GCR. For the last two 
examples we use r = 5% for both SCR and GCR and p = 2 for GCR. 

Example 6.1. Consider the regression 
(17) Y = Xf+X 2 + ae, 

where X ~ N(0, ^4), so that predictor ellipticity holds, e ~ N(0, 1) and e _L 
_L X . Here the central subspace is of dimension q = 2 and is spanned by the 



CONTOUR REGRESSION 



25 



vectors (1,0,0,0) T and (0,1,0,0) T . We compare SCR and GCR with SIR, 
SAVE and PHD using three different values of the error standard deviations: 
a = 0.1, 0.4 and 0.8. Because OLS can pick up at most one direction, it is 
not included in this comparison. For each value of a we draw 500 samples of 
size n = 100, and on each sample we apply the five methods to produce five 
estimates of Sy\x- Next we compute the distance between these estimates 
and the true central subspace according to the definition at the beginning 
of this section. Finally, we compute an average and a standard error from 
the resulting 500 distances, for each a value and estimation method. Results 
are presented in Table 1 (DIST and SE columns correspond to average and 
standard error of the distances, resp.). 

The numbers in Table 1 indicate that both SCR and GCR outperform 
SIR, SAVE and PHD in this example. Intuitively this is because SIR does not 
perform well when there is no linear trend, thus failing to pick up the second 
direction (0, 1,0, 0) T , whereas PHD, and to a lesser extent SAVE, do not 
perform well when there is no quadratic trend, thus failing to give accurate 
estimates of the first direction (1,0,0,0) T . In contrast, both SCR and GCR, 
as also demonstrated theoretically, provide comprehensive estimates of the 
central subspace. Note that SAVE performs better than SIR and PHD — by 
inspecting a few typical cases (results not presented) we find that SAVE 
does a better job at picking up the linear trend than PHD. Nevertheless, it 
is much less accurate than SCR and GCR. From the table we can also see 
that GCR generally outperforms SCR. 

In the next example predictor ellipticity is maintained, but the comparison 
is based on a more complex regression surface in which linear and quadratic 
trends are not neatly separated along two coordinate directions. In this more 
complicated case, SIR, SAVE and PHD can also detect both directions. 

Example 6.2. Consider the regression 

Y = AV(0.5 + (X 2 + 1.5) 2 ) + (1 + X 2 ) 2 + ae, 

Table 1 

Comparison of SCR, GCR and other methods for Example 6.1 



SCR GCR SIR SAVE PHD 



a 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


0.1 


0.23 


0.11 


0.16 


0.07 


0.78 


0.24 


0.43 


0.25 


0.80 


0.21 


0.4 


0.25 


0.11 


0.20 


0.08 


0.79 


0.23 


0.54 


0.27 


0.79 


0.21 


0.8 


0.31 


0.13 


0.32 


0.16 


0.80 


0.23 


0.73 


0.25 


0.79 


0.21 
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Table 2 

Comparison of SCR, GCR and other methods for Example 6.2 



a 


SCR 


GCR 


SIR 


SAVE 


PHD 




DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


0.1 


0.44 


0.25 


0.28 


0.15 


0.39 


0.21 


0.61 


0.26 


0.71 


0.25 


0.4 


0.47 


0.25 


0.33 


0.18 


0.40 


0.21 


0.65 


0.26 


0.70 


0.25 


0.8 


0.54 


0.26 


0.45 


0.25 


0.49 


0.24 


0.73 


0.24 


0.73 


0.24 



where X and e are as denned in Example 6.1. Here, again, q = 2 and the 
central subspace is spanned by the vectors (1,0,0,0) T and (0,1,0,0) T . We 
explore again the same grid of values for a, using the same number of samples 
and sample size as in Example 6.1. Results are presented in Table 2. 

We see that there is still a substantial improvement by GCR over SIR, 
SAVE and PHD. SIR slightly outperforms SCR, but the latter is much more 
accurate than SAVE and PHD. 

In Section 5 we provided a population-level argument for the robustness 
of GCR against nonellipticity of the distribution of X. In the next example 
we compare GCR with OLS, PHD, SIR and SAVE when the distribution of 
X is not elliptical. 

Example 6.3. Consider the regression 

V = sin 2 (vrV2 + 1) + ere, 

with predictor I"gK 4 uniformly distributed on the set 

[0, l] 4 \ {x G M 4 : x t < 0.7, i = 1, 2, 3, 4}, 

which defines a four-dimensional cube with a corner removed (this expedient 
is used to create an obvious asymmetry in the predictor distribution). We 
take again e ~ N(0, 1) and e JL X. Here, the central subspace is of dimension 
q = 1 and is spanned by the vector (0, 1,0,0) T . We perform the comparison 
once again drawing 500 samples of size n = 100 for each value a = 0.1, 0.2 
and 0.3. Results are presented in Table 3. 

We see that GCR achieves a substantial improvement over OLS and PHD, 
and a modest one over SIR and SAVE. It also appears that SIR and SAVE 
are more robust than OLS and PHD against departures from ellipticity of 
X. 

Next we compare contour regression methods with existing methods on 
instances where both the predictor dimension p and the sample size n are 
much larger than in the previous examples. 
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Example 6.4. We consider two cases with predictor dimension p = 10. 
We start with the regression 

(18) Y = cos(3Xi/2) +Xl/2 + ae, 

where X = (X u . . . , X W ) T ~ N(0, I w ), e±X and e~AT(0,l). The cen- 
tral subspace has dimension q = 2 and is spanned by (1,0, ...,0) T and 
(0, 1, . . . , 0) T . As in Example 6.1, OLS is not considered in the compari- 
son, as it can only detect one direction. The error standard deviation a is 
fixed at 0.1,0.4 and 0.8 as in Examples 6.1 and 6.2, and for each such value 
we draw again 500 samples. Because of the increased dimension we now use 
a larger sample size; n = 500. The coefficients of the two terms cos(3Xi/2) 
and Xf are chosen so that the "signal" for Xf is strong for all a values, while 
the "signal" for cos(3Xi/2) is relatively weak for a = 0.8. In this fashion, 
we can gather a sense of how the form of the regression function affects the 
performance of the various methods. 

For both SCR and GCR we use r = 5% of the 500 x 499/2 = 124,750 
empirical directions (Xj — Xj) with the smallest \Yi — Yj\ or V(Zi, Zj; p). For 
GCR the tube size is taken to be p = 2. Results are presented in Table 4. 

For cj = 0.1,0.4, both SCR and GCR, and especially GCR, achieve a 
marked improvement over the other methods. For a = 0.8, though GCR 
still achieves some improvement, the accuracy of SCR is comparable to that 
of other estimators, indicating that under this level of noise the signal of 



Table 3 

Comparison of GCR and other methods for Example 6.3 





GCR 


OLS 


PHD 


SIR 


SAVE 


a 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST SE 


0.1 


0.10 


0.05 


0.17 


0.07 


0.24 


0.10 


0.13 


0.06 


0.14 0.08 


0.2 


0.12 


0.06 


0.19 


0.09 


0.29 


0.12 


0.18 


0.08 


0.22 0.12 


0.3 


0.20 


0.14 


0.22 


0.10 


0.36 


0.16 


0.22 


0.10 


0.34 0.20 












Table 4 










Comparison 


of SCR, GCR and other methods for Example 


6.4, 


regression (18) 




SCR 


GCR 


PHD 


SIR 


SAVE 


a 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST SE 


0.1 


0.41 


0.12 


0.35 


0.07 


1.02 


0.20 


1.27 


0.18 


0.45 0.12 


0.4 


0.63 


0.22 


0.45 


0.11 


1.04 


0.21 


1.28 


0.18 


0.80 0.29 


0.8 


1.04 


0.27 


0.85 


0.20 


1.07 


0.22 


1.31 


0.14 


1.35 0.17 
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cos(3Ai/2) has dropped below the level detectable by most methods. We 
also observe that with sample size n = 500 the accuracy of SAVE has sig- 
nificantly increased compared with the previous examples where n = 100, 
suggesting that the relatively low accuracy of SAVE in the previous ex- 
amples is probably due to its efficiency rather than the lack of population 
exhaustiveness . 

The choice of p = 2 (compared to 1 used in the previous examples) is linked 
to the increased dimensionality: for large p observations become sparse, and a 
thicker tube is needed to capture enough points. Experiments with numerous 
regression specifications invariably indicate that GCR with a relatively large 
tube size achieves outstanding improvements in accuracy. To benchmark 
the effect of p, the following simple quantity is useful: Let X\, A2 and 
A3 be three independent observations from an A(0,Iio), and consider the 
probability of A3 falling within the tube through X\ and A2 with p = 2. 
This probability is easily computed by simulation. From 500,000 simulated 
replicates of X\ , A2 , A3 we find that 

Pr(d(A 3 ,£(Ai, A 2 )) < 2) « 2.37%. 

Thus for sample size n = 500 there are on average 2 + 11.85 ~ 14 obser- 
vations in each tube. In contrast, if we take p = 1 the probability falls to 
approximately 9.4 x 10 -5 , and the expected number of observations in each 
tube is 2.047, essentially equivalent to SCR. 

Next, to confirm these results, we consider again the simpler regression 
in (17) and keep all the specifications of Example 6.1, except for taking 
A ~ A(0, Jio), n = 500, r = 5% and p = 2. Results are presented in Table 5. 

We see that the broad patterns in Table 4 are confirmed, but the im- 
provement by SCR and GCR appears to be more significant. The accuracy 
of SCR compared with GCR is increased somewhat, probably due to the 
simpler form of the regression. 

In the next example we compare the performance of the above methods 
when the structural directions are determined by the variance, rather than 

Table 5 

Comparison of SCR, GCR and other methods for Example 6.4, regression (17) 



SCR GCR PHD SIR SAVE 



a 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


0.1 


0.34 


0.07 


0.31 


0.06 


1.34 


0.12 


1.41 


0.04 


0.47 


0.23 


0.4 


0.36 


0.07 


0.36 


0.07 


1.35 


0.11 


1.41 


0.04 


0.80 


0.36 


0.8 


0.44 


0.09 


0.49 


0.10 


1.34 


0.11 


1.41 


0.04 


1.35 


0.13 
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Table 6 

Comparison of SCR, GCR and other methods for Example 6.5 



a 


SCR 


GCR 


PHD 


SIR 


SAVE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST 


SE 


DIST SE 


0.0 


1.34 


0.12 


1.34 


0.11 


1.52 


0.13 


1.63 


0.19 


1.35 0.14 


0.5 


1.36 


0.10 


1.34 


0.11 


1.55 


0.15 


1.37 


0.11 


1.38 0.11 


1.0 


1.35 


0.11 


1.34 


0.11 


1.59 


0.14 


1.35 


0.11 


1.44 0.14 



the mean, function. This example demonstrates that, although the location 
structure (7) was postulated in deriving sufficient conditions for exhaustive 
estimation and robustness against nonellipticity, contour regression methods 
can be very effective also for regressions that are not based on location. 

Example 6.5. We consider the regression 

Y = \{X 1 -afe, 

where X ~ -/V(0, Iio), £ ~ N(0, 1) and e IL X. Here the central subspace has 
dimension q = l and is spanned by (1,0,...,0) T . The variance of Y is a 
quadratic function of X\ centered at a, which is fixed at a = 0,0.5, 1. Once 
again we generate 500 samples of size n = 500, and use r = 5% for SCR 
and GCR and p = 2 for GCR. Table 6 contains results for the comparison of 
SCR, GCR, PHD, SIR and SAVE. Cook and Li [5] proved that both OLS and 
PHD operate within the central mean subspace, and are therefore incapable 
of estimating a direction that only appears in the variance function. We 
include PHD in the comparison to serve as a benchmark for our subspace 
distance statistics. 

Table 6 shows that contour regression methods are indeed capable of 
estimating the variance function direction, because their accuracy is much 
higher than the benchmark accuracy of PHD. Overall, the accuracy of SCR 
and GCR is similar to that of SIR and SAVE. We also observe that when a 
is small, SAVE is more accurate than SIR, and the opposite is true when a is 
large. The accuracy of contour methods does not appear to depend markedly 
on a. It is also worth mentioning that the errors in Table 6 are significantly 
larger than those in the previous examples regardless of the method used. 
This simply reflects the fact that estimating variance structures is more 
difficult than estimating mean structures. 

Finally, we come to the issue of estimating the structural dimension q. As 
mentioned in Section 2.4, we believe that an asymptotic test for SCR can be 
developed along the same lines employed by existing methods such as SIR 
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and PHD, though this will require work beyond the scope of the present 
paper. The development of an asymptotic test for GCR would hinge on an 
asymptotic analysis of the GCR estimator, which has not been pursued here. 
Nevertheless, we can empirically assess the capability of SCR and GCR to 
estimate q by examining how much the eigenvalues corresponding to the 
central subspace are separated from those corresponding to its complement 
(i.e., the contour space). 
For SCR we use the matrix 

(19) 2/ p -5r 1 / 2 ^(c)$r 1/2 . 

This is the sample version of the population matrix 2I P — Y,~ 1 / 2 K(c)T,~ 1 / 2 . 
From Theorem 2.3 we know that the eigenvalues of the population matrix 
corresponding to contour directions are identically and those corresponding 
to the central subspace are strictly positive. Thus we expect the eigenval- 
ues of the sample matrix (19) to behave similarly. We consider again the 
simulations for regressions (18) and (17) in Example 6.4, with a = 0.4. We 
compute the ten eigenvalues of the matrix (19) for each of the 500 sam- 
ples, say \j> i, . . . , A^ io- From each of the ten sets of simulated eigenvalues 
{Aij, . . . , A5oo,j}i J = 1, - . - , 10, we then compute an average X.j and a stan- 
dard error fj. These numbers are shown in the two SCR columns of Table 7. 
For GCR we use the matrix 

G(c) = (Zi-ZjXZi-Zjf 




where the index set iV is as defined in (4). This matrix is proportional to 
F(c) in (12), rescaled so that it estimates E[(Z{ — Zj)(Zi — Zj) T \ |lj — < c] 
instead of E[(Zi - Zj)(Zi - Zj) T I{\Ii - Ij\ < c)]. Though we did not prove 
a theorem for GCR analogous to Theorem 2.3, we mimic the SCR case and 
compare the eigenvalues of 2I p — G(c) (it is the separation of eigenvalues 
that matters here). The simulation averages and standard errors of these 
eigenvalues over 500 samples for regressions (18) and (17) (with a = 0.4) are 
shown in the two GCR columns of Table 7. 

From Table 7 we see that for both methods, and in both regressions, 
the eigenvalues Agj and Aioj, which correspond to vectors in the central 
subspace, are significantly larger than the other eigenvalues. Furthermore, 
the contrast between Agj and Aioj and the remaining eigenvalues appears 
to be stronger for GCR than for SCR, suggesting that GCR is more sensitive 
in identifying the central subspace. 
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7. An application. We consider a data set concerning the effect on soil 
evaporation of various air and soil conditions such as temperature, humidity 
and wind speed ([12]; it is available in the Arc package — see http:/ /www.stat. 
umn.edu/arc/software.html). There are p= 10 predictors: average daily air 
temperature (Avat), area under the daily humidity curve (Avh), area un- 
der the daily soil temperature curve (Avst), maximum daily air tempera- 
ture (Maxat), maximum daily humidity (Maxh), maximum daily soil tem- 
perature (Maxst), minimum daily air temperature (Minat), minimum daily 
humidity (Minh), minimum daily soil temperature (Minst) and total wind 
speed in miles/hour (Wind). The response is daily soil evaporation (Evap). 
The data are collected over a period of 46 days, but do not show any obvi- 
ous serial dependence. Hence for simplicity we treat the data as independent 
replicates with n = 46. 

Figure 2 is the scatterplot matrix of the ten predictors, which does not 
seem to suggest serious departures from ellipticity. Furthermore, simultane- 
ous Box-Cox transformations of these predictors do not lead to significant 
improvements in ellipticity. Hence we use the untransformed predictors for 
our analysis. We apply both SIR and GCR to the data, using the negative 
Evap as Y. The two upper panels of Figure 3 are the scatterplots of Y versus 
the first two SIR directions, SIR1 and SIR2, on the standardized scale Z . 
The scatterplot for Y versus SIR1 (upper-left panel) shows a strong mono- 
tone trend which is almost linear. In contrast, the scatterplot of Y versus 
SIR2 (upper-right panel) does not show a detectable pattern. The two lower 

Table 7 

Averages (EVAL) and standard errors (SE) of eigenvalues from SCR and 

GCR 



MODEL I MODEL II 



EVAL(SE) 


SCR 


GCR 


SCR 


GCR 


A.i 


(ft) 


-0.26 (0.05) 


-0.55 (0.12) 


-0.23 (0.04) 


-0.48 (0.11) 


A.2 


(ft) 


-0.18 (0.04) 


-0.37 (0.10) 


-0.15 (0.04) 


-0.32 (0.09) 


A.3 


(ft) 


-0.11 (0.04) 


-0.23 (0.08) 


-0.09 (0.03) 


-0.21 (0.08) 


A.4 


(ft) 


-0.05 (0.04) 


-0.11 (0.08) 


-0.04 (0.03) 


-0.10 (0.07) 


A.5 


(ft) 


0.01 (0.04) 


0.00 (0.07) 


0.02 (0.04) 


0.00 (0.06) 


A.6 


(ft) 


0.07 (0.04) 


0.11 (0.07) 


0.07 (0.04) 


0.10 (0.07) 


A.7 


(ft) 


0.14 (0.04) 


0.23 (0.08) 


0.13 (0.04) 


0.20 (0.07) 


A.8 


(ft) 


0.23 (0.05) 


0.37 (0.08) 


0.21 (0.05) 


0.33 (0.07) 


A.9 


(ft) 


0.41 (0.08) 


0.91 (0.11) 


0.72 (0.07) 


1.08 (0.07) 


A. io 


(fto) 


1.17 (0.06) 


1.23 (0.07) 


1.14 (0.07) 


1.21 (0.07) 



MODEL I is regression (18) with a = 0.4, and MODEL II is regression (17) 
with a = 0.4 and a ten-dimensional predictor X. 
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panels of Figure 3 are the scatterplots of Y versus the first two GCR direc- 
tions, GCR1 and GCR2, on the standardized scale Z. The plot for Y versus 
GCR1 (lower-left panel) also shows a clear monotone, but slightly nonlin- 
ear, trend. What is interesting, however, is that the scatterplot of Y versus 
GCR2 (lower-right panel) suggests a U -shaped pattern. A three-dimensional 
spin plot of Y versus (GCR1, GCR2) shows a mean surface that, roughly 
speaking, is folded in the GCR2 direction and tilted upwards in the GCR1 
direction. In the Y-versus-GCR2 scatterplot, five points (labeled by "+") 
sit above the [/-shape near GCR2 = 0, appearing to weaken the [/-shaped 
pattern. However, these points are far out in the direction of GCR1 with 
high values of Y — corresponding to the five points labeled by "+" in the 
perspective scatterplot in Figure 4. 
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Fig. 2. Scatterplot matrix for the predictors in the soil evaporation data. 
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Because in this data set p = 10 and n = 46, as discussed in Section 6 
(following Example 6.4) we need to choose a rather large radius p to capture 
enough points in each tube. For this application of GCR we used p = 3.5 
(on the Z-scale) and 15% of the ( 4 2 6 ) = 1035 pairs (i.e., 155 pairs) of points 
among which V(Zi, Zy, p = 3.5) are the smallest. For SIR we used six slices 
defined so as to contain roughly the same number of points. 

Although without a formal testing procedure we cannot yet determine 
the statistical significance of GCR2, the 2D and 3D scatterplots from our 
GCR analysis do suggest that a second direction might be relevant in the 
evaporation data. Due to its small sample size relative to the dimension of 
the predictor, this example does not allow us to draw strong conclusions, but 
it demonstrates once again that GCR is more sensitive than classical meth- 



> 




-10 12 3 -2-1012 



GCR1 GCR1 

Fig. 3. Scatterplots of the response (— Evap,) versus the first two SIR directions (upper 
panels) and the first two GCR directions (lower panels) for the soil evaporation data. 
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ods in detecting complex regression surfaces — in this instance monotone in 
one direction and U -shaped in another. This was anticipated by theoretical 
analysis in Section 4 and supported by simulation studies in Section 6. 

8. Conclusions. The contour regression methods introduced in this pa- 
per have strength in several aspects. First, under mild conditions they achieve 
exhaustive estimation of the central subspace at the -^/n-convergence rate. 
In comparison with existing global estimators such as OLS, PHD and SIR, 
contour regression estimators are more comprehensive, capable of picking 
up all directions in the central subspace without relying on special response 
patterns (e.g., monotone or [/-shaped trends). In particular, GCR achieves 
exhaustiveness essentially without any assumption other than that of ellip- 
ticity of the distribution of X. Second, by design contour regression meth- 
ods are capable of exploiting interslice information which is not accessible to 
methods based on slicing. This partly explains their improved accuracy over 
SIR and SAVE, which we have discussed from an analytical standpoint and 
documented through simulations. In fact, we think that the advantage of 
contour methods over SAVE is due to the gain in efficiency such as achieved 
via the use of interslice information, and not the structural inability of SAVE 
to capture linear trends. Third, GCR achieves a degree of robustness against 
nonellipticity of the distribution of A". In this respect contour regression is 
akin to the adaptive methods mentioned in the Introduction. Unlike adap- 
tive methods, however, contour methods are computationally simple, the 




Fig. 4. A view of the 3D plot of the response (— Evap ) versus the first two GCR directions 
in the soil evaporation data. 
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level of computational burden being essentially that of principal compo- 
nent analysis. In particular, they do not require iterative maximization of a 
multivariate nonparametric function, which can be a substantial advantage, 
especially if the dimension is large, or if multiple local maxima are present 
in the iterative maximization. 

Because contour vectors are extracted according to a threshold on re- 
sponse variation, our methods are logically analogous to a one-dimensional 
kernel or nearest-neighbor estimator. If the distribution of the predictor X 
is elliptical, the threshold need not go to zero in our asymptotic arguments 
for SCR, which makes it possible to achieve the y/n-rate regardless of the di- 
mensions p and q. In this respect contour regression is similar to traditional 
global methods such as OLS, PHD, SIR and SAVE. However, if ellipticity 
fails and/or is difficult to establish through predictor transformations, we 
can employ a relatively small threshold, operating in a spirit more similar 
to that of adaptive methods. 

We do not claim that contour regression estimators will outperform other 
methods under all circumstances. For example, OLS is the maximum like- 
lihood estimator if the regression surface is linear and the error term is 
normal, and tends to perform very well if the surface is nearly linear or 
clearly monotone. Similarly favorable circumstances exist for PHD, SIR and 
SAVE as well. 

The ideas of contour regression raise many questions that have not been 
addressed within the the scope of this paper. In particular, the asymptotic 
properties of GCR, as well as test statistics for estimating the structural di- 
mension q, have not yet been developed. We do expect that -^/n-convergence 
can be achieved by GCR if the threshold c is taken as fixed, because this in 
effect includes in the computation a number of line segments proportional 
to the total number of observation pairs. We also expect that test statis- 
tics for determining q can be constructed based on Theorem 2.3, along lines 
similar to those in [2, 14]. Also, we have not provided a systematic method 
for choosing the thresholding constant c (or the ratio r) for SCR and GCR, 
as well as the tube radius p for GCR, which should ideally be based on 
data-driven criteria. Other useful developments will concern the asymptotic 
behavior of GCR when the threshold c is allowed to go to zero as the sample 
size n tends to infinity. Theorem 5.1 suggests that even without ellipticity 
of X the correct asymptotic behavior would still be guaranteed. However, 
in this case we do not expect a -^/n-convergence rate — at least not for all 
structural dimensions. To further improve efficiency it may be helpful to ex- 
periment with windows other than the current rectangular ones in selecting 
contour vectors. It may also be possible to apply an idea similar to local 
linear regression [11] to correct the possible edge effect caused by the line 
segments lying in the outskirts of the data cloud. Another worthwhile line 
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of research would be to extend contour methods to dependent data, for ex- 
ample, to weakly dependent Gaussian time series (see [23]). Finally, as we 
have seen from Example 2.3 (binary response), contour methods do apply to 
discrete numerical responses. Moving forward along this direction, we could 
generalize contour methods to ordinal categorical or even purely categorical 
responses. This will require appropriate renditions for the concepts of ab- 
solute difference (e.g., "the same" and "not the same") and distributional 
spread (e.g., a concentration index) for categorical modalities; these could be 
used to define contour direction identifiers for SCR and GCR, respectively. 
We leave these issues to future studies. 

APPENDIX 

Proof of Lemma 2.1. (a) Let denote the joint densities of (Vi, Vj), 
and so on. For example, f\ is the density of V\ and /123 is the joint density 
of (Vi, V2, V3). Similarly, let /^-i^ and so on denote conditional densities. For 
example, / 2 3|4 is the conditional joint density of (V2, V3) given V4. We need 
to show that 

(20) /l245|36 Ol , V 2 , U4, V 5 1 V 3 , V 6 ) = / 14 | 36 {v X ,V 4 \ V 3 , V G )f 25 \ 36 (t> 2 , V 5 \ V 3 , V 6 ) . 

Without loss of clarity we can omit v\ , . . . , v q from the density. Thus the 
above equality becomes /i245|36 = /i4|36/25|36- The left-hand side of (20) is 

/l23456//36- 

Because (Vi , V 2 , V 3 ) JL (V 4 , V 5 , V 6 ) , V\ JL V 2 \ V 3 and V 4 JL V 5 \ V 6 , the numerator 
in the above ratio is factorized into /113/2I3/3/4 e/sie/e = hzf-mftofaKhh): 
and the denominator is factorized into f 3 fe. Thus the left-hand side reduces 
to /i3/23/46/56/(/3/e) 2 - The right-hand side of (20) is the ratio /1436/2536/ 'f 36 - 
Because (Vi, V 2 , V 3 ) JL (V4, V5, V&) this ratio becomes fi 3 fmf23fm/{fzf&) 2 , 
the same quantity to which the left-hand side of (20) is reduced, 
(b) Suppose (Vi, V2)-1L (V3, V4). We want to show that 

(21) /l3|24 = /l|24/3|24- 

The left-hand side is /i234//24 5 which, because (Vi, V 2 ) JL (V 3 , V4), reduces to 
/12/34/C/2/4). The right-hand side of (21) is /i 2 4/ 3 24/(/24) 2 = fvhhhi/ {f2fi? = 
f 12/34/ {f'ifi), which completes the proof. □ 

Proof of Theorem 4.1. Because the proof is basically the same as 
that of Theorem 2.1, we only highlight the differences. There is no change 
in paragraphs 1, 2, 3, 4, 6 of the proof of Theorem 2.1 except for replacing, 
wherever applicable, \Y - Y\ < c by V(X,X) < c, K(c) by G(c), K x (c) by 

G^c) = E[(Z -Z)(Z- Z) T \V(Z, Z)<c], 
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and "Assumption 2.1" by "Assumption 4.1." 

Replace the fifth paragraph by the following argument: Because (Z, Y) 
and (4>i(Z),Y) have the same distribution, and because (Z,Y) and (Z,Y) 
are independent, the distributions of (Z, Y, Z, Y) and (4>i(Z), Y, (j>i(Z),Y) are 
identical. Hence 

E[(Z - Z){Z - Z) T \V(Z, Z) < c] 

= E[{4>i(z) - mz))(Mz) - Mz)f\v(Mz),Mz)) < c]. 

We claim that, for any a and b in M p , V(a, b) = V(4>i(a),<f>i(b)). By definition, 

V(a, b) = var(Y\Z = (1 - t)a + tb for some t). 

Because (Z,Y) and (cj>i(Z),Y) have the same distribution, the condition 
in the above conditional variance can be replaced by <j>i{Z) = (1 — t)a + tb 
or Z = c/> i " 1 ((l — t)a + tb). Because, as we have noted, fa = <f>~ 1 , and also 
because 4>i : MP i— >• MP is a linear function, the condition can be replaced by 
Z = (1 — t)(f>i{a) + t(f>i{b). Therefore, 

V(a,b) = y&T(Y\Z=(l-t)(j) i (a)+t(f) i (b) for some t) = V(<fc(a), &(&)), 

as claimed. Consequently, 

s[(z-z)(z-z) T |y(z,z) < c ] 

= s[(0i(z) - 0i(z))(^(z) - <Mz)f|y(z, ^) < c]. 

Now follow through the rest of the fifth paragraph in the proof of Theo- 
rem 2.1, replacing \Y — Y\ < c by V(X,X) < c in one place. □ 

PROOF of Lemma 3.2. By Fubini's theorem we have 

E(S)-E(T)= r r(^l-l]q(s)dsdt= [°° G{t)dt. 
Jo Jt \q{s) J Jo 

We now show that G(t) < for all t > 0. Because p(s)/q(s) is a decreasing 
function and because 

the function p(s)/q(s) is greater than 1 at s = 0, equal to 1 at some sq > 
and less than 1 afterward. Hence G'(t) = q(t) — p(t) is less than for t < sq 
and greater than for t > sq. So G(t) first decreases and then increases. 
However, it is easy to see that 67(0) = and lim^oo G(t) = 0. Hence G(t) < 
for all t > 0. □ 

Lemma A.l. Let S\ and S2 be two linear subspaces of MP and let A 
be an open set in MP containing the origin. Suppose An Si C AnS2- Then 
Sx cS 2 . 
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Proof. Let v be a vector in S\. Because A is an open set containing 
the origin, for sufficiently small A > 0, Xv £ A. Hence Xv G A n Si. Because 
AnSi C AnS2, Ai> also belongs to Ar\S2- Hence Xv belongs to S2. Because 
S2 is a linear subspace, « belongs to S2. □ 
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